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Abstract 

The discovery of non-linear causal relationship under additive non-Gaussian noise models has attracted 
considerable attention recently because of their high flexibility. In this paper, we propose a novel causal 
inference algorithm called least-squares independence regression (LSIR). LSIR learns the additive noise model 
through the minimization of an estimator of the squared-loss mutual information between inputs and residuals. 
A notable advantage of LSIR over existing approaches is that tuning parameters such as the kernel width and 
the regularization parameter can be naturally optimized by cross-validation, allowing us to avoid overfitting 
in a data-dependent fashion. Through experiments with real-world datasets, we show that LSIR compares 
favorably with a state-of-the-art causal inference method. 



1 Introduction 

Learning causality fro m data is one of the important challenges in the artificial intelligence, statistics, and machine 
learning communities ([Pearl , l2000h . A traditional method of learning causal relationship from observational data 
is based on the linear- dependence Gaussian- noise model (jGeiger and Heckermanl 11994). However, the linear- 
Gaussian assumption is too restrictive and may not be fulfilled in practice. Recently, non-Gaussianity and 
non-linearity have been shown to be beneficial in causal i nference, allowing one to break symmetry between 
observed variables ( Shimizu et 

oD, 120061: iHover et all l2009h . Since then, much attention has bee n paid to the 
discovery of non-linear causal relationship through non-Gaussian noise models (|Mooii et aUl2009l) . 



In the framework of non-linear non-Gaussian causal inference, the relation between a cause X and an effect 
Y is assumed to be described by Y = f(X) + E, where / is a non-linear function and E is non-Gaussian 
additive noise which is independent of the cause X. Given two random variables X and X', the causal direction 
between X and X' is decided based on a hypothesis test of whether the model X' = f(X) + E or the alternative 
model X = f'(X') + E' fits the data wel l — here, the goodn ess of fit is measured by independence between 
inputs and residuals (i.e ., estimated n oise). IHover et al. (2009) proposed to learn the functions / and /' by the 
Gaussian process (GP) (iBishod - liool . and evaluat e the independence b etween the inputs and the residuals by 



the Hilbert- Schmidt independence criterion (HSIC) ( Gretton et al. , 20051 ) 



However, since standard regression methods such as GP are designed to handle Gaussian noise, they may not 
be suited for discovering causality in the non-Gaussian additive noise formulation. To cope with this problem, 
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a novel regression method called HSIC regression (HSICR) has been introduced recently ( Mooii et al. , 2009). 
HSICR learns a function so that the dependence between inputs and residuals is directly minimized based on 
HSIC. Since HSICR does not impose any parametric assumption on the distribution of additive noise, it is suited 
for non-linear n on-Gaussian causa l inference. Indeed, HSICR was shown to outperform the GP-based method 
in experiments ( Mooii et al. , 20091) . 

However, HSICR still has limitations for its practical use. The first weakness of HSICR is that the kernel 
width of HSIC needs to be dete rmined manually . Sinc e the choice of the kernel width heavily affects the sensitivity 
of the independence measure ( Fukumizu et al , 20091) . lack of systematic model selection strategies is critical in 
causal inference. Se tting the kernel width to th e median distance between sample points is a popular heuristic 
in kernel methods (jScholkopf and Smola . 2002 ). but this does not always perform well in practice. Another 
limitation of HSICR is that the kernel width of the regression model is fixed to the same value as HSIC. This 
crucially limits the flexibility of function approximation in HSICR. 

To overcome the above weaknesses, we propose an alternative regression method called least-squares inde- 
pendence regression (LSIR). As HSICR, LSIR also learns a function so that the dependence between inputs and 
residuals is directly minimized. However, a difference is t hat, instead of HSI C, LSIR adopts an independence 
criterion called least-squares mutual information (LSMI) ( Suzuki et al. , 20091 ). which is a consistent estimator 
of the squared-loss mutual information (SMI) with the optimal convergence rate. An advantage of LSIR over 
HSICR is that tuning parameters such as the kernel width and the regularization parameter can be naturally 
optimized through cross-validation (CV) with respect to the LSMI criterion. 

Furthermore, we propose to determine the kernel width of the regression model based on CV with respect 
to SMI itself. Thus, the kernel width of the regression model is determined independent of that in the indepen- 
dence measure. This allows LSIR to have higher flexibility in non-linear causal inference than HSICR. Through 
experiments with real-world datasets, we demonstrate the superiority of LSIR. 



2 Dependence Minimizing Regression by LSIR 

In this section, we formulate the problem of dependence minimizing regression and propose a novel regression 
method, least-squares independence regression (LSIR). 

2.1 Problem Formulation 



Supp ose random variables 1st and Y G M arc connected by the following additive noise model (jHover et al. 
20091) : 

Y = f(X) + E, 

where / : M — > K is some non-linear function and E G K is a zero- mean random variable independent of X. The 
goal of dependence minimizing regression is, from i.i.d. paired samples {{xi, j/i)}f =1 , to obtain a function / such 
that input X and estimated additive noise E = Y — f(X) are independent. 
Let us employ a linear model for dependence minimizing regression: 



(1) 



i=i 



where m is the number of basis functions, (3 = (j3\, . . . , /3 m ) T are regression parameters, T denotes the transpose, 
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and ip{x) = (ipi(x), . . . , ip m (x)) T are basis functions. We use the Gaussian basis function in our experiments: 



ipi (x) = exp 



{x-cif 
2r 2 



where q is the Gaussian center chosen randomly from {xi}™ =l without overlap and r is the kernel width. 
In dependence minimization regression, we learn the regression parameter f3 as 



mm 



I(X 7 E) + j(3 T (3 



where I{X, E) is some measure of independence between X and E, and 7 > is the regularization parameter 
for avoiding overfitting. 

In this paper, we use the squared-loss mutual information (SMI) ( Suzuki et ai , 2009() as our independence 
measure: 



sm(x,E) 



1 



p{x, e) 



1 I p(x)p(e)dxde. 



% p(x)p(e) 

SMI(X, E) is the Pearson divergence ( Pearson , 19001 ) from p(x,e) to p(x)p(e), and it vanishes if and only if 
p(x,e) agrees with p(x)p( e), i.e., X and E are independent. Note that ordinary mutual information (MI) 
(|Cover and Thomasl . l2006h . 



MI(X, E) 



p(x, e) log — — — ^dxde, 



(2) 



' P(x)p(e) 

corresponds to the Kullback-Leibler divergence ( Kullback and Leibleii Il951 ) from p(x,e) and p(x)p(e), and it 
can also be used as an independence measure. Nevertheless, we adhere to using SMI since it allows us to obtain 
an analytic-form estimator, as explained below. 



2.2 Estimation of Squared-Loss Mutual Information 

SMI cannot be directly computed since it contains unknown densities p( x,e), p(x), and p(e) . Here, we briefly 
review an SMI estimator called least-squares mutual infor mation (LSMI) ( Suzuki et aZl l2009f ). 

Since density estimation is known to b e a hard problem dVapnikl . 19981 ) , avoiding density estimation is critical 
for obtaining better SMI approximators (|Kraskov et al. |,|2004|). A key idea of LSMI is to directly estimate the 
density ratio: 

_ p{x,e) 
r(x 7 e) — . , 
p[x)p[e) 

without going through density estimation of p(x, e), p(x), and pie). 

In LSMI, the density ratio function r(x,e) is directly modeled by the following linear model: 

b 

r a (x,e) = y^q;y;(a;,e) = a T (p(x,e), 



1=1 



where b is the number of basis functions, a — {ax , . . . , o?b) T are parameters, and <p(x, e) — (ipi (x, e) 
are basis functions. We use the Gaussian basis function: 



ipi(x,e) = exp 



(x - ui) 2 + (e-vif 
2a 2 



(3) 

,(f b (x,e)) T 
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where (it;, vi) is the Gaussian center chosen randomly from {(xi, e'j)}^_ 1 without replacement, and a is the kernel 
width. 

The parameter a in the model r a (x,e) is learned so that the following squared error Jq(cx) is minimized: 
Ja(<x) = g / / ( r a( s .e) - r(a;,e)) 2 p(a;)p(e)dxde 



r a (a;, e)p(x)p(e)da;de 



r a (a;, e)p(x, e)dxde + C, 



where C is a constant independent of a and therefore can be safely ignored. Let us denote the first two terms 
by J (a): 



J(a) — Jq{ol) — C — —a Ha — h T a, 



(4) 



where 



H 

h 



<p(x, e)<p(x, e) T p(x)p(e)dxde, 
ip(x, e)p(x, e)dxde. 



Approximating the expectations in H and h by empirical averages, we obtain the following optimization problem: 

1 



a = argmm 



-a 1 Ha — no, + \a? a 



where a regularization term XaJ a. is included for avoiding overhtting, and 



1 ™ 

H = ^X! ( P{xi ) ej)(p(x i ,e j ) T , 
1 ™ 

h = - y2ip(xi,ei). 



Differentiating the above objective function with respect to a. and equating it to zero, we can obtain an analytic- 
form solution: 



a = (H + Ub)~% 



(5) 



where lb denotes the 6-dimensional identity m atrix. It was shown that LSMI is consistent under mild assumptions 
and it achieves the optimal convergence rate (IKanamori et al. 

Given a density ratio estimator r = ra, SMI can be simply approximated as follows ( Suzuki and SugivamaL 

20101) : 



. _ i ™ i n i 



2n 2 

h 1 a a Ha . 

2 2 



(6) 



Input: {(xi,ei)}f =1 , {cr*}?^, and {\j}] =1 
Output: LSMI parameter a 

Compute CV score for {cr^ }f =1 and {Aj}j =1 by Eq.©; 
Choose a and A that minimize the CV score; 
Compute a by Eq.© with a and A; 

Figure 1: Pseudo code of the LSMI algorithm with CV. 



2.3 Model Selection in LSMI 

LSMI contains three tuning parameters: the number of basis functions b, the kernel width a, and the regular- 
ization parameter A. In our experiments, we fix b — min(200,n), and choose a and A by cross-validation (CV) 
with grid search as follows. First, the samples Z = {zi \ Zi = (xi,ei)}f =1 are divided into K disjoint subsets 
{Zk}^ =1 of (approximately) the same size (we set K = 2 in experiments). Then, an estimator S.z k is obtained 
using Z\Zk (i.e., without Zf.), and the approximation error for the hold-out samples Zk is computed as 

J z k =2 az * Zk aZk ~ z * ° LZk ' 
where, for \Zk\ being the number of samples in the subset Zk, 

Hz * = TTv2 v( a; 'e)v(a;,e) T , 

^ Zk = Jz~\ ^ V^e). 

1 fe| (x,e)ez h 

This procedure is repeated for k — 1, . . . , K, and its average j( K ~ cv ) is outputted as 

/c=i 

We compute j( K ~ cw ) for all model candidates (the kernel width a and the regularization parameter A in the 
current setup) , and choose the model that minimizes 

j(if-CV)_ Note that j(K-CV) 

is an almost unbiased estimator 

of the objective function (jj), where the al most-ness comes from t he fac t that the number of samples is reduced 



in the CV procedure due to data splitting (jScholkopf and Smolal [2002) 



The LSMI algorithm is summarized in Figure [TJ 
2.4 Least-Squares Independence Regression 

Given the SMI estimator (|6]) , our next task is to learn the parameter f3 in the regression model (pj as 



(3 = argmin 
P 



SMl(X,E) + ^P T P 



We call this method least-squares independence regression (LSIR). 
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For regression parameter learning, we simply employ a gradient descent method: 

wher e r\ is a step size w hich may be chosen in practice by some approximate line search method such as Armijo's 
rule ( Patriksson , 19991) . 

The partial derivative of SMI(X, E) with respect to (3 can be approximately expressed as 



dSMl{X.E) J^^dh t 1 A ^^..dHi V 



where 



dhi 




dp 


n 1 

% 


dHi,i> 


i 


dp 




dipi(x,e) 




dp 


~~ ~2 



-qq W ( x j . e i) + <Pi fa , ej) — 

1 

—ft 



In the above derivation, we ignored the dependence of P on ei. It is possible to exactly compute the derivative 
in principle, but we use this approximated expression since it is computationally efficient. 

We assumed that the mean of the noise E is zero. Taking into account this, we modify the final regressor as 



1 n 

fix) = f${x) + -^2(vi~ fp(%i) 



n 

i=i 



2.5 Model Selection in LSIR 

LSIR contains three tuning parameters — the number of basis functions m, the kernel width r, and the regu- 
larization parameter 7. In our experiments, we fix m = min(200,n), and choose t and 7 by CV with grid 
search as follows. First, the samples Z = {zi | z% — (x*, ej)}™ =1 are divided into T disjoint subsets {Z t }J =1 of 
(approximately) the same size (we set T — 2 in experiments). Then, an estimator Pz t is obtained using Z\Z t 
(i.e., without Z t ), and the independence criterion for the hold-out samples Z t is computed as 



ffr-cv) n - 1 - 1 

I z t = h z t <*z t - -a Zt H Zt OLz t - 2- 

This procedure is repeated for t = 1, . . . , T, and its average /( T ~ cv ) is computed as 

? t-cv )= l^-ov). (9 ) 



T 



We compute cv ) for all model candidates (the kernel width r and the regularization parameter 7 in the 
current setup), and choose the model that minimizes 

The LSIR algorithm is summarized in Figure [2] A MATLAB® implementation of LSIR is available from 



6 



Input: {(xi,j/i)}?=i, {jOLii and {7i>l=i 
Output: LSIR parameter /3 

Compute CV score for all {Ti}f =1 and {^j}j =1 by Eq.©; 
Choose t and 7 that minimize the CV score; 
Compute (3 by gradient descent ([5} with f and 7; 

Figure 2: Pseudo code of the LSIR algorithm with CV. 
'http : / / sugiyama-www . cs . titech . ac . jp/~yamada/lsir . html'. 



3 Causal Direction Inference by LSIR 

In the previous section, we gave a dependenc e minimizing regres sion method, LSIR, that is equipped with CV 



regr 

for model selection. In this section, following iHover et al\ (|2009l ). we explain how LSIR can be used for causal 
direction inference. 

Our final goal is, given i.i.d. paired samples {(#,-, to determine whether X causes Y or vice versa. 

To this end, we test whether the causal model Y — fy(X) + Ey or the alternative model X = fx(Y) + Ex 
fits the data well, where the goodness of fit is measured by independence between inputs and residuals (i.e., 
estimated noise). Independ ence of inputs and residuals may be decided in practice by the permutation test 



( Efron and Tibshiranil 119931) 



More specifically, we first run LSIR for {(xi, as usual, and obtain a regression function /. This 

procedure also provides an SMI estimate for {(xi, ei) | ei = y,— f(xi)}f =l . Next, we randomly permute the pairs 
of input and residual {(xi,ei)}f =1 as {(xi, e^/j))}^, where «(•) is a randomly generated permutation function. 
Note that the permuted pairs of samples are independent of each other since the random permutation breaks the 
dependency between X and E (if exists). Then we compute SMI estimates for the permuted data {(xi, e K (j))}" =1 
by LSMI. This random permutation process is repeated many times (in experiments, the number of repetitions is 
set to 1000), and the distribution of SMI estimates under the null- hypothesis (i.e., independence) is constructed. 
Finally, the p-value is approximated by evaluating the relative ranking of the SMI estimate computed from the 
original input-residual data over the distribution of SMI estimates for randomly permuted data. 

In order to decide the causal direction, we compute the p- values px^Y and px^Y for both directions X — » Y 
(i.e., X causes Y) and X <— Y (i.e., Y causes X). For a given significance level 8, we determine the causal 
direction as follows. 

• If px^rY > 8 and px^-Y < 8, the model X — > Y is chosen. 

• If px^-Y > 5 and px^Y < 5, the model X <— Y is selected. 

• If px^rY ,Px^-Y < 8, then we conclude that there is no causal relation between X and Y. 

• If px->Y ,Px<t~y > 8, perhaps our modeling assumption is not correct. 

When we have prior knowledge that there exists a causal relation between X and Y but their the causal 
direction is unknown, we may simply compare the values of px^Y and px-t-Y as follows: 

• If px^rY > Px^y, we conclude that X causes Y. 
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• Otherwise, we conclude that Y causes X. 

This simplified procedure allows us to avoid the computational expensive permutation process. 

In our preliminary experiments, we empirically observed that SMI estimates obtained by LSIR tend to be 
affected by the basis function choice in LSIR. To mitigate this problem, we run LSIR and compute an SMI 
estimate 5 times by randomly changing basis functions. Then the regression function that gives the smallest 
SMI estimate among 5 repetitions is selected and the permutation test is performed for that regression function. 



4 Existing Method: HSIC Regression 



In this section, we first review the Hilbert- Schmidt independence criterion (H SIC) dGretton et al , 2005) and 
point out its potential weaknesses. Then, we review HSIC regression (HSICR) dMooii et alL\200$) . 



4.1 HSIC 



The Hilbert- Schmidt independence criterion (HSIC) ( Gretton et alY 20051) i s a state-of-the-ar t meas ure of sta- 
tistical independence based on characteristic functions (see also iFeuerverger , 1993t Kankainenl . Il995h . Here, we 
review the definition of HSIC and explain its basic properties. 

Let J 7 be a reproducing kernel Hilbert space (RKHS) with reproducing kernel K[x, x') (|Aronszainl .[l950). and 
Q be another RKHS with reproducing kernel L(e, e'). Let C be a cross- covariance operator from Q to i.e., 
for all / G J- and g G Q, 



(f,Cg)? = 



f( x )~ / f{x)p(x)dx 



9(e)- / g(e)p(e)de 



where (•, denotes the inner product in T . Thus, C can be expressed as 



C 



K(-,x) — / K(-,x)p(x)dx 



L(-,e)- L(-,e)p(e)de 



\p(x, e)dxde, 



\p(x, e)dxde, 



where '(8)' denotes the tensor product, and we used the reproducing properties: 

f(x) = (f,K(-,x))j? and g(e) = (g, £(•, e))g. 

The cross-covariance operator is a generalization o f the cross-covariance matrix between random vectors. 
When T and Q are universal RKHSs (jSteinwartl . 120011) defined on compact domains X and £, respectively, the 
largest singular value of C is zero if and only if x and e are independent. Gaussian RKHSs are examples of the 
universal RKHS. 

HSIC is defined as the the squared Hilbert- Schmidt norm (the sum of the squared singular values) of the 
cross-covariance operator C: 



HSIC 



K (x, x')L(e, e')p{x, e)p(x' ', e')da;deda;de' 



-2 



K (x, x')p(x)p{x')dxdx' 
K(x,x')p(x')dx' 



L(e,e')p(e)p(e')dede' 



L(e,e')p(e')de' 



p(x, e)dxde. 
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The above expression allows one to immediately obtain an empirical estimator — with the i.i.d. samples Z = 
{(%k, Zk)}k=i following p(x, e), a consistent estimator of HSIC is given as 

^ n 1 71 

HSIC(X,i?) := — K(xi,Xi')L(ei,eji) + — ^ 2J K(xi,Xi>)L(ej,ef) 



2 ^ 

— K(x i ,x k )L(e j ,e k ) 



ij,k=l 



MKTLT), 



(10) 



where 



K iti > = K(xi,Xi/), Ljj> = i(ej,e,/), and r = J„ 



-1 1 



J„ denotes the n-dimensional identity matrix, and l n denotes the n-dimensional vector with all ones. 



HSIC depends on the choice of the universal RKHSs T and Q. In the original HSIC paper (jGretton el al 



2005), the Gaussian RKHS with width set to the median distance betw een samples was used, which is a popular 



heuristic in the kernel method community (jScholkopf and Smolal . 120021 ) . However, to the best of our knowledge, 
there is no strong theoretical justification for this heuristic. On the other hand, the LSMI method is equipped 
with cross-validation, and thus all the tuning parameters such as the Gaussian width and the regularization 
parameter can be optimized in an objective and systematic way. This is an advantage of LSMI over HSIC. 



4.2 HSIC Regression 

In HSIC regression (HSICR) (|Mooii et Qi.l . l2009h . the following linear model is employed: 



fe(x) = J2 ei ^ = eT ^> 



(11) 



where 8 = (9\, 6 n ) T are regression parameters and 4>{x) = (<tn (x) <t>„ (x)) T are basis functions. iMooii et al 

(2009) proposed to use the Gaussian basis function: 



(j>l (x) = exp 



{x - Xlf 
2p 2 



where the kernel width p is set to the median distance between points in the samples: 

p = 2- 1 / 2 median({||a ;i - ^||}^ =1 ). 
Given the HSIC estimator dTU|) . the parameter in the regression model (fTTj) is obtained by 



6 = argmin 
e 



HSic(x,y - j e (x)) + ^e T d 



(12) 



where £ > is the regularization parameter for avoiding overfitting. 
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In the HSIC estimator, the Gaussian kernels, 



K(x, x') = exp ( ~^-z g - ) and L(e, e') = exp ( — ^ 



e 



are used and their kernel widths are set to the median distance between points in the samples: 

cr x = 2~ 1/2 median({||iE l - 

a c = 2- 1 /2 mcdian({ || ei _ ej .||}^ =i) ^ 



The o ptimization problem (|12p can be efficiently solved by using the L-BFGS quasi-Newton method ( Liu and Nocedail . 



19891 ) or gradient descent. 



Then, the final regressor is given as 

1 ™ 

f(x)=fg(x) + -Y l {Vi-fs(*i))- 



n 

i=l 



Note that, since it is not allowed to change the kernel width a c during the optimization (|12p. er c is fixed to an 
estimate obtained based on an initial rough estimate of the residuals. This fact implies that, if the estimation 
accuracy of a e is poor, the overall performance of HSICR will be degraded. On the other hand, the LSIR 
method is equipped with cross-validation, and thus all the tuning parameters can be optimized in an objective 
and systematic way. This is a significant advantage of LSIR over HSICR. 

5 Experiments 

In this section, we first illustrate the behavior of LSIR using a toy example, and then we evaluate the performance 
of LSIR using benchmark datasets and real-world gene expression data. 

5.1 Illustrative Examples 

Let us consider the following additive noise model: 

Y = X 3 + E, 

where X is subject to the uniform distribution on (— 1, 1) and E is subject to the exponential distribution with 
rate parameter 1 (and its mean is adjusted to have mean zero). We drew 300 paired samples of X and Y following 
the above generative model (see Figure [3]) , where the ground truth is that X and E are independent of each 
other. Thus, the null- hypothesis should be accepted (i.e., the p- values should be large). 

Figure [3] depicts the regressor obtained by LSIR, giving a good approximation to the true function. We re- 
peated the experiment 1000 times with the random seed changed. For the significance level 5%, LSIR successfully 
accept ed the null-h y pothe sis 992 times out of 1000 runs. 

As iMooii et all (|2009l ) pointed out, beyond the fact that the p- values frequently exceed the pre-specified 



significance level, it is important to have a wide margin beyond the significance level in order to cope with, 



e.g., multiple variable cases. Figure 4(a) depicts the histogram of px^Y obtained by LSIR over 1000 runs. The 
plot shows that LSIR tends to produce much larger p-values than the significance level; the mean and standard 
deviation of the p-values over 1000 runs are 0.6114 and 0.2327, respectively. 
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Figure 3: Illustrative example. The solid line denotes the true function, the circles denote samples, and the 
dashed line denotes the regressor obtained by LSIR. 



Next, we consider the backward case where the roles of X and Y were swapped. In this case, the ground 
truth is that the input and the residual are dependent (see Figure [3]). Therefore, the null- hypothesis should be 
rejected (i.e., the p-values should be small). Figure |4(b)| shows the histogram of px^-Y obtained by LSIR over 
1000 runs. LSIR rejected the null-hypothesis 989 times out of 1000 runs; the mean and standard deviation of 
the p-values over 1000 runs are 0.0035 and 0.0094, respectively. 



Figure 4(c) depicts the p-values for both directions in a trial-wise manner. The graph shows that LSIR 
perfectly estimates the correct causal direction (i.e., px->Y > Px^y), and the margin between px^Y and px^-Y 
seems to be clear (i.e., most of the points are clearly below the diagonal line). This illustrates the usefulness of 
LSIR in causal direction inference. 

Finally, we investigate the values of independence measure SMI, which are plotted in Figure |4(d)| again in 
a trial-wise manner. The graph implies that the values of SMI may be simply used for determining the causal 
direction, instead of the p- values. Indeed, the correct causal direction (i.e., SMIx^y < SMIx<-y) can be found 
999 times out of 1000 trials by this simplified method. This would be a practically useful heuristic since we can 
avoid performing the computationally intensive permutation test. 

5.2 Benchmark Datasets 

Next, we evaluate the perfo rmance of LSIR on the 'Cause-Effect Pairs' task in the NIPS 2008 Causality Com- 
petition ( Mooii et all 20081) . The task contains 8 datasets (see Figure [5j), each has two statistically dependent 



random variables possessing inherent causal relationship. The goal is to identify the causal direction from the 
observational data. Since these datasets consist of real-world samples, our modeling assumption may be only 
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(a) Histogram of px_,y obtained by 
LSIR over 1000 runs. The ground truth 
is to accept the null-hypothesis (thus the 
p- values should be large). 
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(b) Histograms of px<—Y obtained by 
LSIR over 1000 runs. The ground truth 
is to reject the null-hypothesis (thus the 
p- values should be small). 
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(c) Comparison of p-values for both di- 
rections (px->y vs - Px<— y)' Points be- 
low the diagonal line indicate successful 
trials. 




0.05 
SMI-hat x ^ y 



(d) Comparison of values of independence 
measures for both directions (SMIjc— >Y" 
vs. SMIx<— y)- Points above the diagonal 
line indicate successful trials. 



Figure 4: LSIR performance statistics in illustrative example. 



approximately satisfied. Thus, identifying causal directions in these datasets would be highly challenging 
The p-values and the independence scores for each dataset and each direction are sum marized in Table 



The values of HSICR, which were also computed by the permutation test, were taken from (jMooii et aZ.1 . 1200 



but the p-values were rounded off to three decimal places to be consistent with the results of LSIR. When the 
p-values of both directions are less than 10 -3 , we concluded that the causal direction cannot be determined 
(indicated by '?'). 

Table [T] shows that LSIR successfully found the correct causal direction for 7 out of 8 cases, while HSICR 
gave the correct decision only for 5 out of 8 cases. This implies that LSIR compares favorably with HSICR. 

The values of independence measures described in Table [1] show that merely comparing the values of SMI is 
again sufficient for deciding the correct causal direction in LSIR (see the estimated causal directions described 
in the brackets). Actually, this heuristic also allows us to correctly identify the causal direction in Dataset 8. 
On the other hand, in HSICR, this convenient heuristic is not as useful as in the case of LSIR. 
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Figure 5: Datasets of the 1 Cause-Effect Pairs' task in the NIPS 2008 Causality Competition ( Mooii et al , 2008) 



5.3 Gene Function Regulations 

Finally, we apply our proposed LSIR method to the real-world biological datasets, which contain known causal 
relationships about gene function regulations from transcription factors to gene expressions. 

Causal prediction is biologically and medically important because it gives us a clue for disease-causing genes 
or drug-target genes. Transcription factors regulate expression levels of their relating genes. In other words, 
when the expression level of transcription factor genes is high, genes regulated by the transcription factor become 
highly expressed or suppressed. 



In this experiment, we select 10 well-known gene regulation relationships of E. coli (jFaith et al\ , 120071) . where 
each data contains expression levels of the genes over 445 different environments (i.e., 445 samples, see Figure [B]) 

The experimental results are summarized in Table [21 showing that LSIR successfully found the correct causal 
direction for 7 out of 10 cases, while HSICR gave the correct decision only for 4 out of 10 cases. Moreover, the 
causal direction can be efficiently chosen 9 out of 10 cases just by comparing the values of SMI. 



6 Conclusions 

In this paper, we proposed a new method of dependence minimization regression called least-squares independence 
regression (LSIR). LSIR adopts the squared-loss mutual information as an independence measure, and it is 
estimated by the method of least-squares mutual information (LSMI). Since LSMI provides an analytic-form 
solution, we can explicitly compute the gradient of the LSMI estimator with respect to regression parameters. A 
notable ad vantage of the prop osed LSIR method over the state-of-the-art method of dependence minimization 



l e p 

regression (jMooii et al\ . [20091 ) is that LSIR is equipped with a natural cross-validation procedure, allowing us 



to objectively optimize tuning parameters such as the kernel width and the regularization parameter in a data- 
dependent fashion. We experimentally showed that LSIR is promising in real-world causal direction inference. 
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Table 1: Results for the 1 Cause-Effect Pairs^ task in the NIPS 2008 Causality Competition (jMooii et all l2008h . 
When the p- values of both directions are less than 10~ 3 , we concluded that the causal direction cannot be 
determined (indicated by '?'). Estimated directions in the brackets are determined based on comparing the 
values of SMI or HSIC. 

(a) LSIR 



Dataset 


p- values 


SMI 


Direction 




X ->• Y 


X <- Y 


X -> Y 


X <- Y 


Estimated 


Truth 


1 


0.031 


< ID" 3 


0.0057 


0.0265 


->■ (->■) 


— > 


2 


0.004 


< lO- 3 


0.0182 


0.0301 


->• (->) 




3 


0.099 


0.009 


0.0090 


0.0147 


-> (->) 


-> 


4 


0.102 


0.173 


0.0075 


0.0051 


<- («— ) 


4— 


5 


< 10~ 3 


0.012 


0.0234 


0.0108 


<- (<— ) 


^~ 


6 


0.058 


0.001 


0.0079 


0.0154 


-> (->) 


-> 
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0.009 


0.018 


0.0121 


0.0110 


(<-) 




8 


< to" 3 


< lO" 3 


0.0149 


0.0244 


? (->) 


— ► 


(b) HSICR 


Dataset 


p- values 


HSIC 


Direction 




X ->• Y 


X <- Y 


X -> Y 


X <- Y 
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Truth 


1 


0.290 


< lO" 3 


0.0012 


0.0060 




— > 


2 
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0.014 
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0.0021 






3 


0.045 


0.003 


0.0019 


0.0026 






4 


0.376 


0.012 


0.0011 


0.0023 






5 


< 10~ 3 


0.160 


0.0028 


0.0005 


-s— (<— ) 




6 


< 10-3 


< lO" 3 


0.0032 


0.0026 


? (-<-) 




7 


< 10-3 


0.272 


0.0021 


0.0005 


-s— («— ) 




8 


< to" 3 


< lO" 3 


0.0015 


0.0017 




— > 
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